{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \ud83d\udcc3 Solution for Exercise M2.01\n",
    "\n",
    "The aim of this exercise is to make the following experiments:\n",
    "\n",
    "* train and test a support vector machine classifier through cross-validation;\n",
    "* study the effect of the parameter gamma of this classifier using a\n",
    "  validation curve;\n",
    "* use a learning curve to determine the usefulness of adding new samples in\n",
    "  the dataset when building a classifier.\n",
    "\n",
    "To make these experiments we first load the blood transfusion dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "blood_transfusion = pd.read_csv(\"../datasets/blood_transfusion.csv\")\n",
    "data = blood_transfusion.drop(columns=\"Class\")\n",
    "target = blood_transfusion[\"Class\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we use a support vector machine classifier (SVM). In its most simple\n",
    "form, a SVM classifier is a linear classifier behaving similarly to a logistic\n",
    "regression. Indeed, the optimization used to find the optimal weights of the\n",
    "linear model are different but we don't need to know these details for the\n",
    "exercise.\n",
    "\n",
    "Also, this classifier can become more flexible/expressive by using a so-called\n",
    "kernel that makes the model become non-linear. Again, no understanding regarding\n",
    "the mathematics is required to accomplish this exercise.\n",
    "\n",
    "We will use an RBF kernel where a parameter `gamma` allows to tune the\n",
    "flexibility of the model.\n",
    "\n",
    "First let's create a predictive pipeline made of:\n",
    "\n",
    "* a [`sklearn.preprocessing.StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)\n",
    "  with default parameter;\n",
    "* a [`sklearn.svm.SVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)\n",
    "  where the parameter `kernel` could be set to `\"rbf\"`. Note that this is the\n",
    "  default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.svm import SVC\n",
    "\n",
    "model = make_pipeline(StandardScaler(), SVC())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Evaluate the generalization performance of your model by cross-validation with\n",
    "a `ShuffleSplit` scheme. Thus, you can use\n",
    "[`sklearn.model_selection.cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html)\n",
    "and pass a\n",
    "[`sklearn.model_selection.ShuffleSplit`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html)\n",
    "to the `cv` parameter. Only fix the `random_state=0` in the `ShuffleSplit` and\n",
    "let the other parameters to the default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.model_selection import cross_validate, ShuffleSplit\n",
    "\n",
    "cv = ShuffleSplit(random_state=0)\n",
    "cv_results = cross_validate(model, data, target, cv=cv, n_jobs=2)\n",
    "cv_results = pd.DataFrame(cv_results)\n",
    "cv_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "print(\n",
    "    \"Accuracy score of our model:\\n\"\n",
    "    f\"{cv_results['test_score'].mean():.3f} \u00b1 \"\n",
    "    f\"{cv_results['test_score'].std():.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As previously mentioned, the parameter `gamma` is one of the parameters\n",
    "controlling under/over-fitting in support vector machine with an RBF kernel.\n",
    "\n",
    "Evaluate the effect of the parameter `gamma` by using\n",
    "[`sklearn.model_selection.ValidationCurveDisplay`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ValidationCurveDisplay.html).\n",
    "You can leave the default `scoring=None` which is equivalent to\n",
    "`scoring=\"accuracy\"` for classification problems. You can vary `gamma` between\n",
    "`10e-3` and `10e2` by generating samples on a logarithmic scale with the help\n",
    "of `np.logspace(-3, 2, num=30)`.\n",
    "\n",
    "Since we are manipulating a `Pipeline` the parameter name is `svc__gamma`\n",
    "instead of only `gamma`. You can retrieve the parameter name using\n",
    "`model.get_params().keys()`. We will go more into detail regarding accessing\n",
    "and setting hyperparameter in the next section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "import numpy as np\n",
    "\n",
    "from sklearn.model_selection import ValidationCurveDisplay\n",
    "\n",
    "gammas = np.logspace(-3, 2, num=30)\n",
    "param_name = \"svc__gamma\"\n",
    "disp = ValidationCurveDisplay.from_estimator(\n",
    "    model,\n",
    "    data,\n",
    "    target,\n",
    "    param_name=param_name,\n",
    "    param_range=gammas,\n",
    "    cv=cv,\n",
    "    scoring=\"accuracy\",  # this is already the default for classifiers\n",
    "    score_name=\"Accuracy\",\n",
    "    std_display_style=\"errorbar\",\n",
    "    errorbar_kw={\"alpha\": 0.7},  # transparency for better visualization\n",
    "    n_jobs=2,\n",
    ")\n",
    "\n",
    "_ = disp.ax_.set(\n",
    "    xlabel=r\"Value of hyperparameter $\\gamma$\",\n",
    "    title=\"Validation curve of support vector machine\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "Looking at the curve, we can clearly identify the over-fitting regime of the\n",
    "SVC classifier when `gamma > 1`. The best setting is around `gamma = 1` while\n",
    "for `gamma < 1`, it is not very clear if the classifier is under-fitting but\n",
    "the testing score is worse than for `gamma = 1`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, you can perform an analysis to check whether adding new samples to the\n",
    "dataset could help our model to better generalize. Compute the learning curve\n",
    "(using\n",
    "[`sklearn.model_selection.LearningCurveDisplay`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LearningCurveDisplay.html))\n",
    "by computing the train and test scores for different training dataset size.\n",
    "Plot the train and test scores with respect to the number of samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.model_selection import LearningCurveDisplay\n",
    "\n",
    "train_sizes = np.linspace(0.1, 1, num=10)\n",
    "LearningCurveDisplay.from_estimator(\n",
    "    model,\n",
    "    data,\n",
    "    target,\n",
    "    train_sizes=train_sizes,\n",
    "    cv=cv,\n",
    "    score_type=\"both\",\n",
    "    scoring=\"accuracy\",  # this is already the default for classifiers\n",
    "    score_name=\"Accuracy\",\n",
    "    std_display_style=\"errorbar\",\n",
    "    errorbar_kw={\"alpha\": 0.7},  # transparency for better visualization\n",
    "    n_jobs=2,\n",
    ")\n",
    "\n",
    "_ = disp.ax_.set(title=\"Learning curve for support vector machine\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "We observe that adding new samples to the training dataset does not seem to\n",
    "improve the training and testing scores. In particular, the testing score\n",
    "oscillates around 76% accuracy. Indeed, ~76% of the samples belong to the\n",
    "class `\"not donated\"`. Notice then that a classifier that always predicts the\n",
    "`\"not donated\"` class would achieve an accuracy of 76% without using any\n",
    "information from the data itself. This can mean that our small pipeline is not\n",
    "able to use the input features to improve upon that simplistic baseline, and\n",
    "increasing the training set size does not help either.\n",
    "\n",
    "It could be the case that the input features are fundamentally not very\n",
    "informative and the classification problem is fundamentally impossible to\n",
    "solve to a high accuracy. But it could also be the case that our choice of\n",
    "using the default hyperparameter value of the `SVC` class was a bad idea, or\n",
    "that the choice of the `SVC` class is itself sub-optimal.\n",
    "\n",
    "Later in this MOOC we will see how to better tune the hyperparameters of a\n",
    "model and explore how to compare the predictive performance of different model\n",
    "classes in a more systematic way."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}